1 Regression Trees

library(rpart)
library(rpart.plot)
library(ISLR)
library(knitr)
library(caret)
library(MASS)

Denote for the node \(T\)

Regression trees use splitting criterion different from classification ones. In the anova method the splitting criteria is \[SS_T - (SS_L + SS_R)\] where \(SS_L\) and \(SS_R\) are the sums of squares for the left and right son of \(T\) respectively.
The anova method leads to regression trees; it is the default method if target variable is a simple numeric vector, i.e., not a factor, matrix, or survival object.

1.1 Example from Section 8.1.1 of ISLR: Predicting Baseball Players’ Salaries

Use the Hitters data from the ISLR library for a simple example

data("Hitters")
# Remove incomplete cases 
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263  20
kable(head(Hitters,3))
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475 N
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480 A
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500 N

Predict a baseball player’s Salary based on the number of Years spent in the major league, and the number of Hits in the previous year.
Salary is measured in thousands of dollars.

Fit the model

salaryFit <- rpart(Salary~Hits+Years, data=Hitters)

The regression tree can be visualized by prp().

prp(salaryFit,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

Interpret the tree.
Which factors are the most important for size of compensation? Are there any questionable results of the tree model?

#Under 4.5 years and under 42 its salary is higher than under 3.5 years and any number of hits
#It is also higher than under 4.5 years and any number of hits.

Tree with too small terminals is unreliable and misleading.
This tree is too deep.

Analyze the depth of the tree. How many nodes are necessary to keep the tree stable and still accurate?

Decision about appropriate depth of the tree is made based on parameter CP and the error columns.

A rule of thumb is that tree needs to be pruned at a minimum level where rel error + xstd < xerror.
Check output cptable.

salaryFit$cptable
##           CP nsplit rel error    xerror      xstd
## 1 0.24674996      0 1.0000000 1.0177861 0.1392688
## 2 0.18990577      1 0.7532500 0.7723039 0.1283051
## 3 0.02052199      2 0.5633443 0.5963625 0.1040882
## 4 0.01428090      3 0.5428223 0.6416193 0.1098044
## 5 0.01162544      4 0.5285414 0.6554035 0.1108470
## 6 0.01087042      5 0.5169159 0.6633079 0.1109284
## 7 0.01026659      8 0.4843047 0.6566277 0.1110065
## 8 0.01000000     10 0.4637715 0.6574350 0.1111501
cbind(salaryFit$cptable[,3]+salaryFit$cptable[,5],salaryFit$cptable[,4])
##        [,1]      [,2]
## 1 1.1392688 1.0177861
## 2 0.8815552 0.7723039
## 3 0.6674325 0.5963625
## 4 0.6526267 0.6416193
## 5 0.6393884 0.6554035
## 6 0.6278444 0.6633079
## 7 0.5953112 0.6566277
## 8 0.5749216 0.6574350

The rule of thumb shows that the tree should be pruned at 3 splits or at CP= 0.020522.
Show this tree.

prunedTree <- prune(salaryFit, cp = salaryFit$cptable[3,1])
printcp(prunedTree)
## 
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
## 
## Variables actually used in tree construction:
## [1] Hits  Years
## 
## Root node error: 53319113/263 = 202734
## 
## n= 263 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.246750      0   1.00000 1.01779 0.13927
## 2 0.189906      1   0.75325 0.77230 0.12831
## 3 0.020522      2   0.56334 0.59636 0.10409

The output of printcp() contains an important information about the number of variables participating in splits.
Extract that number.

  • First, look at the field of the rpart object, called frame. Each row of that data frame contains information about one node. Column var contains the name of the variable involved in the split. Terminal nodes are denoted as <leaf>
  • To find all variables involved in splits, select the rows that are not leafes and return their names.
tree_frame<-prunedTree$frame
tree_leaves_idx<-tree_frame$var=="<leaf>"
(used_variables<-unique(tree_frame$var[!tree_leaves_idx]))
## [1] "Years" "Hits"

It is also possible to extract importance of variables measured by “goodness of split” (how much the loss decreased each time the variable participated in split as main or surrogate split).

prunedTree$variable.importance
##    Years     Hits 
## 13644470 10564157

Plot the pruned tree.

prp(prunedTree,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

This tree is smaller: it separates players by experience, then further splits more experienced group by number of hits.

Another way of pruning the tree is finding the best CP level just by xerror: find the level where it bottoms.

printcp(salaryFit)
## 
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
## 
## Variables actually used in tree construction:
## [1] Hits  Years
## 
## Root node error: 53319113/263 = 202734
## 
## n= 263 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.246750      0   1.00000 1.01779 0.13927
## 2 0.189906      1   0.75325 0.77230 0.12831
## 3 0.020522      2   0.56334 0.59636 0.10409
## 4 0.014281      3   0.54282 0.64162 0.10980
## 5 0.011625      4   0.52854 0.65540 0.11085
## 6 0.010870      5   0.51692 0.66331 0.11093
## 7 0.010267      8   0.48430 0.65663 0.11101
## 8 0.010000     10   0.46377 0.65744 0.11115
plotcp(salaryFit)

(best.CP = salaryFit$cptable[which.min(salaryFit$cptable[,"xerror"]),"CP"])
## [1] 0.02052199
prunedTree <- prune(salaryFit, cp = best.CP)
printcp(prunedTree)
## 
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
## 
## Variables actually used in tree construction:
## [1] Hits  Years
## 
## Root node error: 53319113/263 = 202734
## 
## n= 263 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.246750      0   1.00000 1.01779 0.13927
## 2 0.189906      1   0.75325 0.77230 0.12831
## 3 0.020522      2   0.56334 0.59636 0.10409
prp(prunedTree,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

After the tree is pruned forecasting is done by averaging the output (salary) based on mean values of tree leafs.
This tree seems oversimplified.

Look at the residuals from this model, just as with a regular linear regression fit

plot(predict(prunedTree), jitter(resid(prunedTree)))
temp <- prunedTree$frame[prunedTree$frame$var == '<leaf>',]
axis(3, at = temp$yval, as.character(row.names(temp)))
mtext('leaf number', side = 3, line = 3)
abline(h = 0, lty = 2)

Note that node 6 has many players and has pretty good concentration.
But leaf 2 has outliers.

Find the mean squared error of prediction.

(prunedTree.MSE<-sum((Hitters$Salary-predict(prunedTree))^2))
## [1] 30037016
sum(resid(prunedTree)^2)
## [1] 30037016

Try to grow a tree without outliers.
To do this control size of buckets.

salaryFitControlled <- rpart(Salary~Hits+Years, data=Hitters,minbucket=15)
prp(salaryFitControlled,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

(salaryFitControlled.MSE<-sum((Hitters$Salary-predict(salaryFitControlled))^2))
## [1] 27915376
sum(resid(salaryFitControlled)^2)
## [1] 27915376

This tree has better MSE than pruned tree and has more meaningful distribution of nodes.

Answer the following questions using the controlled for size decision tree model:

  1. What is the maximum expected compensation for a player who makes around 100 hits?
  2. If you are a manager of a player who has been in the league for 7 years and makes about 100 hits, how would you set a target for the player that may lead him to increase of compensation by:
  • $300,000-350,000?
  • $500,000 or more?
  1. What is the probability of making $832,000, given that player has been in the league for more than 4.5 years and makes more than 118 hits?

(Skipped Code)

Show that answer to 3. is:

#1. Node #13: $518k
#2. 
# * Increase hits above 118
# * Increase hits beyond 142 and wait for a year
#3.  P(H<142|Y>=4.5,H>=118)=P(118<=H<142,Y>=4.5)/P(Y>=4.5,H>=118)=0.12/0.32

0.12/0.32
## [1] 0.375

Use library(caret) to build a tree.

set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
tree.slice <- train(Salary~Hits+Years, data=Hitters,
                   method = 'rpart', trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
tree.slice$results
##           cp     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
## 1 0.02052199 350.4053 0.4366677 239.6130  99.24844  0.1845119 50.64152
## 2 0.18990577 369.2794 0.3555307 259.0561 101.99208  0.1557585 53.84690
## 3 0.24674996 441.4993 0.1217775 333.6233  93.62968  0.1564601 51.12298
prp(tree.slice$finalModel,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

Calculate MSE.

(caretTree.MSE<-sum(residuals(tree.slice$finalModel)^2))
## [1] 30037016

Compare the MSE values from 2 packages:

c(rpart=prunedTree.MSE,caret.rpart=caretTree.MSE)
##       rpart caret.rpart 
##    30037016    30037016
matplot(1:length(resid(prunedTree)),
        cbind(resid(prunedTree),
              residuals(tree.slice$finalModel),
              resid(salaryFitControlled)),
        pch=c(15,16,16),col=c("black","red","orange"),
        ylab="Residuals",xlab="Index")
legend("topright",legend=c("rpartBestCP","caret","controlledSize"),pch=c(15,16,16),col=c("black","red","orange"))

1.2 Examples from introduction to the package

The following examples are from the long introduction to rpart.

1.3 Introduction to main functions: Cars

The dataset car90 contains a collection of variables from the April, 1990 Consumer Reports; it has 34 variables on 111 cars.
Variables tire size and model name are excluded because they are factors with a very large number of levels which creates a very long printout, and rim size because it is too good a predictor of price and leads to a less interesting illustration. (Tiny cars are cheaper and have small rims.)

cars <- car90[, -match(c("Rim", "Tires", "Model2"), names(car90))]
head(cars)
##               Country Disp Disp2 Eng.Rev Front.Hd Frt.Leg.Room Frt.Shld
## Acura Integra   Japan  112   1.8    2935      3.5         41.5     53.0
## Acura Legend    Japan  163   2.7    2505      2.0         41.5     55.5
## Audi 100      Germany  141   2.3    2775      2.5         41.5     56.5
## Audi 80       Germany  121   2.0    2835      4.0         42.0     52.5
## BMW 325i      Germany  152   2.5    2625      2.0         42.0     52.0
## BMW 535i      Germany  209   3.5    2285      3.0         42.0     54.5
##               Gear.Ratio Gear2  HP HP.revs Height Length Luggage Mileage Price
## Acura Integra       3.26  3.21 130    6000   47.5    177      16      NA 11950
## Acura Legend        2.95  3.02 160    5900   50.0    191      14      20 24760
## Audi 100            3.27  3.25 130    5500   51.5    193      17      NA 26900
## Audi 80             3.25  3.25 108    5300   50.5    176      10      27 18900
## BMW 325i            3.02  2.99 168    5800   49.5    175      12      NA 24650
## BMW 535i            2.80  2.85 208    5700   51.0    186      12      NA 33200
##               Rear.Hd Rear.Seating RearShld Reliability Sratio.m Sratio.p
## Acura Integra     1.5         26.5     52.0 Much better       NA     0.86
## Acura Legend      2.0         28.5     55.5 Much better       NA     0.96
## Audi 100          3.0         31.0     55.0        <NA>       NA     0.97
## Audi 80           1.0         28.0     52.0        <NA>       NA     0.71
## BMW 325i          1.0         25.5     51.5      better       NA     0.88
## BMW 535i          2.5         27.0     55.5        <NA>       NA     0.78
##               Steering Tank Trans1 Trans2 Turning    Type Weight Wheel.base
## Acura Integra    power 13.2  man.5 auto.4      37   Small   2700        102
## Acura Legend     power 18.0  man.5 auto.4      42  Medium   3265        109
## Audi 100         power 21.1  man.5 auto.3      39  Medium   2935        106
## Audi 80          power 15.9  man.5 auto.3      35 Compact   2670        100
## BMW 325i         power 16.4  man.5 auto.4      35 Compact   2895        101
## BMW 535i         power 21.1  man.5 auto.4      39  Medium   3640        109
##               Width
## Acura Integra    67
## Acura Legend     69
## Audi 100         71
## Audi 80          67
## BMW 325i         65
## BMW 535i         69
carfit <- rpart(Price/1000 ~ ., data=cars)
carfit
## n=105 (6 observations deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 105 7118.26700 15.805220  
##    2) Disp< 156 70 1491.62800 11.855870  
##      4) Country=Brazil,Japan,Japan/USA,Korea,Mexico,USA 58  421.21470 10.318470  
##        8) Type=Small 21   50.30983  7.629048 *
##        9) Type=Compact,Medium,Sporty,Van 37  132.80330 11.844890 *
##      5) Country=France,Germany,Sweden 12  270.72330 19.286670 *
##    3) Disp>=156 35 2351.19600 23.703910  
##      6) HP.revs< 5550 24  980.27790 20.388710  
##       12) Disp< 267.5 16  395.99670 17.820060 *
##       13) Disp>=267.5 8  267.58000 25.526000 *
##      7) HP.revs>=5550 11  531.63680 30.937090 *
prp(carfit,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

printcp(carfit)
## 
## Regression tree:
## rpart(formula = Price/1000 ~ ., data = cars)
## 
## Variables actually used in tree construction:
## [1] Country Disp    HP.revs Type   
## 
## Root node error: 7118.3/105 = 67.793
## 
## n=105 (6 observations deleted due to missingness)
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.460146      0   1.00000 1.00373 0.16091
## 2 0.117905      1   0.53985 0.73010 0.11672
## 3 0.112343      2   0.42195 0.72191 0.11616
## 4 0.044491      3   0.30961 0.65157 0.12016
## 5 0.033449      4   0.26511 0.62365 0.12598
## 6 0.010000      5   0.23166 0.61966 0.13556
  • The relative error is scaled to be 1
  • The 1-SE rule of thumb would choose a tree with 3 splits: choose the lowest level for which rel_error + xstd < xerror
  • In this case the default cp value of .01 may have over-pruned the tree, since the cross-validated error is barely at a minimum. A rerun with the cp threshold at .001 gave little change, however.
summary(carfit, cp = 0.1)
## Call:
## rpart(formula = Price/1000 ~ ., data = cars)
##   n=105 (6 observations deleted due to missingness)
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.46014608      0 1.0000000 1.0037274 0.1609054
## 2 0.11790530      1 0.5398539 0.7301001 0.1167165
## 3 0.11234341      2 0.4219486 0.7219094 0.1161613
## 4 0.04449133      3 0.3096052 0.6515713 0.1201572
## 5 0.03344936      4 0.2651139 0.6236470 0.1259838
## 6 0.01000000      5 0.2316645 0.6196641 0.1355606
## 
## Variable importance
##       Disp      Disp2     Weight       Tank         HP Wheel.base    Country 
##         19         18         13         12         11          9          6 
##    HP.revs      Gear2   Front.Hd       Type     Length   Steering      Width 
##          4          3          1          1          1          1          1 
## 
## Node number 1: 105 observations,    complexity param=0.4601461
##   mean=15.80522, MSE=67.79302 
##   left son=2 (70 obs) right son=3 (35 obs)
##   Primary splits:
##       Disp   < 156    to the left,  improve=0.4601461, (0 missing)
##       Disp2  < 2.55   to the left,  improve=0.4601461, (0 missing)
##       HP     < 154    to the left,  improve=0.4548845, (0 missing)
##       Tank   < 17.8   to the left,  improve=0.4431325, (0 missing)
##       Weight < 2890   to the left,  improve=0.3912428, (0 missing)
##   Surrogate splits:
##       Disp2      < 2.55   to the left,  agree=1.000, adj=1.000, (0 split)
##       Weight     < 3095   to the left,  agree=0.914, adj=0.743, (0 split)
##       HP         < 139    to the left,  agree=0.895, adj=0.686, (0 split)
##       Tank       < 17.95  to the left,  agree=0.895, adj=0.686, (0 split)
##       Wheel.base < 105.5  to the left,  agree=0.857, adj=0.571, (0 split)
## 
## Node number 2: 70 observations,    complexity param=0.1123434
##   mean=11.85587, MSE=21.30898 
##   left son=4 (58 obs) right son=5 (12 obs)
##   Primary splits:
##       Country splits as  L-RRLLLLRL, improve=0.5361191, (0 missing)
##       Tank    < 15.65  to the left,  improve=0.3805426, (0 missing)
##       Weight  < 2567.5 to the left,  improve=0.3691043, (0 missing)
##       Type    splits as  R-RLRR,     improve=0.3649737, (0 missing)
##       HP      < 105.5  to the left,  improve=0.3577873, (0 missing)
##   Surrogate splits:
##       Tank         < 17.8   to the left,  agree=0.857, adj=0.167, (0 split)
##       Rear.Seating < 28.75  to the left,  agree=0.843, adj=0.083, (0 split)
## 
## Node number 3: 35 observations,    complexity param=0.1179053
##   mean=23.70391, MSE=67.17703 
##   left son=6 (24 obs) right son=7 (11 obs)
##   Primary splits:
##       HP.revs  < 5550   to the left,  improve=0.3569594, (0 missing)
##       HP       < 173.5  to the left,  improve=0.3146835, (0 missing)
##       Frt.Shld < 57.75  to the right, improve=0.2554028, (0 missing)
##       Front.Hd < 3.25   to the right, improve=0.2278477, (0 missing)
##       Type     splits as  RLR-RL,     improve=0.2178764, (0 missing)
##   Surrogate splits:
##       Country  splits as  -R-RR----L, agree=0.886, adj=0.636, (0 split)
##       Gear2    < 2.735  to the left,  agree=0.886, adj=0.636, (0 split)
##       Disp     < 172.5  to the right, agree=0.800, adj=0.364, (0 split)
##       Disp2    < 2.85   to the right, agree=0.800, adj=0.364, (0 split)
##       Front.Hd < 3.25   to the right, agree=0.800, adj=0.364, (0 split)
## 
## Node number 4: 58 observations
##   mean=10.31847, MSE=7.262322 
## 
## Node number 5: 12 observations
##   mean=19.28667, MSE=22.56027 
## 
## Node number 6: 24 observations
##   mean=20.38871, MSE=40.84491 
## 
## Node number 7: 11 observations
##   mean=30.93709, MSE=48.33062
  • The improvement listed is the percent change in SS for this split, i.e., \(1-\frac{SS_R-SS_L}{SS_T}\), which is the gain in \(R^2\) for the fit
  • The data set has displacement of the engine in both cubic inches (Disp) and liters (Disp2). The second is a perfect surrogate split for the first, obviously.

Surrogate splits may be used when certain features of the input vector are missed, and the prediction procedure may get stuck in the certain node. Then in addition to the best “primary” split, every tree node may also be split to one or more other variables with nearly the same results.

  • The weight and displacement are very closely related, as shown by the surrogate split agreement of 91%

The following plot shows the (observed-expected) cost of cars versus the predicted cost of cars based on the nodes/leaves in which the cars landed.
There appears to be more variability in node 7 than in some of the other leaves.

plot(predict(carfit), jitter(resid(carfit)))
temp <- carfit$frame[carfit$frame$var == '<leaf>',]
axis(3, at = temp$yval, as.character(row.names(temp)))
mtext('leaf number', side = 3, line = 3)
abline(h = 0, lty = 2)

The following graphs show how \(R^2\) improves and how the X Relative error drops with growing number of splits.

par(mfrow=c(1,2))
rsq.rpart(carfit)
## 
## Regression tree:
## rpart(formula = Price/1000 ~ ., data = cars)
## 
## Variables actually used in tree construction:
## [1] Country Disp    HP.revs Type   
## 
## Root node error: 7118.3/105 = 67.793
## 
## n=105 (6 observations deleted due to missingness)
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.460146      0   1.00000 1.00373 0.16091
## 2 0.117905      1   0.53985 0.73010 0.11672
## 3 0.112343      2   0.42195 0.72191 0.11616
## 4 0.044491      3   0.30961 0.65157 0.12016
## 5 0.033449      4   0.26511 0.62365 0.12598
## 6 0.010000      5   0.23166 0.61966 0.13556

par(mfrow=c(1,1))

1.4 Poisson regression: sod example

The Poisson splitting method attempts to extend rpart models to event rate data.
The model in this case is \(\lambda=f(x)\), where \(\lambda\) is an event rate and \(x\) is some set of predictors.

The solder data frame, is a dataset with 900 observations which are the results of an experiment varying 5 factors relevant to the wave-soldering procedure for mounting components on printed circuit boards.
The response variable, skips, is a count of how many solder skips appeared to a visual inspection.

The other variables are:

  1. Opening factor: amount of clearance around the mounting pad \((S<M<L)\)
  2. Solder factor: amount of solder used \((Thin<Thick)\)
  3. Mask factor: type of solder mask used (5 possible)
  4. PadType factor: Mounting pad used (10 possible)
  5. Panel factor: panel (1, 2 or 3) on board being counted

In this call, the rpart.control options are modified:

  • maxcompete = 2 means that only 2 other competing splits are listed (default is 4);
  • cp = .05 means that a smaller tree will be built initially (default is .01).
  • The \(y\) variable for Poisson partitioning may be a two column matrix containing the observation time in column 1 and the number of events in column 2, or it may be a vector of event counts alone
sfit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel,data = solder, method ='poisson',
              control = rpart.control(cp = 0.004, maxcompete = 2))
printcp(sfit)
## 
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel, 
##     data = solder, method = "poisson", control = rpart.control(cp = 0.004, 
##         maxcompete = 2))
## 
## Variables actually used in tree construction:
## [1] Mask    Opening PadType Panel   Solder 
## 
## Root node error: 8788.2/900 = 9.7647
## 
## n= 900 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.3037885      0   1.00000 1.00270 0.0521095
## 2  0.1540759      1   0.69621 0.69973 0.0326983
## 3  0.1284762      2   0.54214 0.54514 0.0252752
## 4  0.0428654      3   0.41366 0.41711 0.0194107
## 5  0.0287271      4   0.37079 0.37616 0.0173988
## 6  0.0272254      5   0.34207 0.36314 0.0171739
## 7  0.0183899      6   0.31484 0.32327 0.0151030
## 8  0.0156026      7   0.29645 0.31862 0.0148515
## 9  0.0125093      8   0.28085 0.29398 0.0136167
## 10 0.0089906     10   0.25583 0.27232 0.0124124
## 11 0.0085255     11   0.24684 0.26565 0.0117998
## 12 0.0066697     13   0.22979 0.25594 0.0114715
## 13 0.0063814     14   0.22312 0.25094 0.0113376
## 14 0.0060658     15   0.21674 0.24973 0.0112412
## 15 0.0059467     16   0.21067 0.24438 0.0109590
## 16 0.0058737     17   0.20473 0.24440 0.0109517
## 17 0.0055449     18   0.19885 0.23646 0.0107538
## 18 0.0050760     19   0.19331 0.23148 0.0105358
## 19 0.0045152     20   0.18823 0.21829 0.0099479
## 20 0.0044702     21   0.18372 0.21508 0.0098090
## 21 0.0044302     22   0.17925 0.21459 0.0099189
## 22 0.0040000     23   0.17482 0.20869 0.0096200
prp(sfit,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

  • The response value is the expected event rate (with a time variable), or in this case the expected number of skips.
  • The deviance is the same as the null deviance (sometimes called the residual deviance) that you’d get when calculating a Poisson glm model for the given subset of data
summary(sfit, cp = 0.1)
## Call:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel, 
##     data = solder, method = "poisson", control = rpart.control(cp = 0.004, 
##         maxcompete = 2))
##   n= 900 
## 
##             CP nsplit rel error    xerror        xstd
## 1  0.303788504      0 1.0000000 1.0027009 0.052109520
## 2  0.154075869      1 0.6962115 0.6997296 0.032698324
## 3  0.128476176      2 0.5421356 0.5451440 0.025275240
## 4  0.042865392      3 0.4136595 0.4171140 0.019410687
## 5  0.028727072      4 0.3707941 0.3761589 0.017398837
## 6  0.027225437      5 0.3420670 0.3631444 0.017173886
## 7  0.018389899      6 0.3148416 0.3232708 0.015103045
## 8  0.015602563      7 0.2964517 0.3186170 0.014851451
## 9  0.012509312      8 0.2808491 0.2939828 0.013616678
## 10 0.008990627     10 0.2558305 0.2723185 0.012412370
## 11 0.008525450     11 0.2468398 0.2656460 0.011799802
## 12 0.006669726     13 0.2297889 0.2559355 0.011471486
## 13 0.006381358     14 0.2231192 0.2509390 0.011337577
## 14 0.006065836     15 0.2167379 0.2497272 0.011241153
## 15 0.005946661     16 0.2106720 0.2443843 0.010959021
## 16 0.005873677     17 0.2047254 0.2444042 0.010951675
## 17 0.005544938     18 0.1988517 0.2364607 0.010753779
## 18 0.005075974     19 0.1933067 0.2314824 0.010535772
## 19 0.004515244     20 0.1882308 0.2182898 0.009947870
## 20 0.004470166     21 0.1837155 0.2150782 0.009808968
## 21 0.004430245     22 0.1792454 0.2145904 0.009918866
## 22 0.004000000     23 0.1748151 0.2086901 0.009619960
## 
## Variable importance
##    Mask Opening  Solder PadType   Panel 
##      38      36      16       9       1 
## 
## Node number 1: 900 observations,    complexity param=0.3037885
##   events=4977,  estimated rate=5.53 , mean deviance=9.764721 
##   left son=2 (600 obs) right son=3 (300 obs)
##   Primary splits:
##       Opening splits as  LLR,   improve=2669.769, (0 missing)
##       Mask    splits as  LLRLR, improve=2162.010, (0 missing)
##       Solder  splits as  LR,    improve=1168.401, (0 missing)
## 
## Node number 2: 600 observations,    complexity param=0.1284762
##   events=1531,  estimated rate=2.552564 , mean deviance=4.927685 
##   left son=4 (420 obs) right son=5 (180 obs)
##   Primary splits:
##       Mask    splits as  LLRLR, improve=1129.0820, (0 missing)
##       Opening splits as  LR-,   improve= 250.7655, (0 missing)
##       Solder  splits as  LR,    improve= 219.7641, (0 missing)
## 
## Node number 3: 300 observations,    complexity param=0.1540759
##   events=3446,  estimated rate=11.48308 , mean deviance=10.53956 
##   left son=6 (150 obs) right son=7 (150 obs)
##   Primary splits:
##       Mask    splits as  LLRRR,      improve=1354.0590, (0 missing)
##       Solder  splits as  LR,         improve= 976.9158, (0 missing)
##       PadType splits as  RRRRLLRLRL, improve= 313.2002, (0 missing)
##   Surrogate splits:
##       Solder splits as  LR, agree=0.6, adj=0.2, (0 split)
## 
## Node number 4: 420 observations
##   events=433,  estimated rate=1.032889 , mean deviance=2.081981 
## 
## Node number 5: 180 observations
##   events=1098,  estimated rate=6.099428 , mean deviance=5.294992 
## 
## Node number 6: 150 observations
##   events=680,  estimated rate=4.534533 , mean deviance=4.350953 
## 
## Node number 7: 150 observations
##   events=2766,  estimated rate=18.42446 , mean deviance=7.701124

Note the splitting criterion. The improvement is \(Deviance_{parent}-(Deviance_R+Deviance_L)\), which is the likelihood ratio test for comparing two Poisson samples

fit.prune <- prune(sfit, cp = 0.1)
printcp(fit.prune)
## 
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel, 
##     data = solder, method = "poisson", control = rpart.control(cp = 0.004, 
##         maxcompete = 2))
## 
## Variables actually used in tree construction:
## [1] Mask    Opening
## 
## Root node error: 8788.2/900 = 9.7647
## 
## n= 900 
## 
##        CP nsplit rel error  xerror     xstd
## 1 0.30379      0   1.00000 1.00270 0.052110
## 2 0.15408      1   0.69621 0.69973 0.032698
## 3 0.12848      2   0.54214 0.54514 0.025275
## 4 0.10000      3   0.41366 0.41711 0.019411
prp(fit.prune,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=T) # display the node numbers, default is FALSE

The function prune() trims the tree fit to the cp = 0.10.
The same tree could have been created by specifying cp= 0.10 in the original call to rpart()

Prepare function for calculating residual mean square error of the model.
Let x be vector of residuals of the model.
Then mrse is calculated as:

rmse <- function(x) sqrt(mean(x^2))

Use rmse() to calculate measure of fit of the pruned tree.

treeRmse<-rmse(resid(fit.prune))

Calculate probability of zero count by the Poisson tree model for row 6 of data frame solder.

Function predict() returns estimated intensity of the Poisson process.

(preTree<-predict(fit.prune,newdata=solder[6,]))
##        6 
## 1.032889
(pred0Tree<-dpois(0,lambda=preTree))
## [1] 0.3559772

1.4.1 Exploring switches on tree

In this section explore application of rules of regression tree to each row of observations.

library(plyr)
library(data.tree)
library(partykit)

Prepare a function creating matrix of all paths from root to leave on given tree.

rpartPaths<-function(x){
  require(plyr)
  require(partykit)
  require(rpart)
  require(data.tree)
  dtTree<-as.Node(as.party(x)) # transform object into data.tree
  myPaths<-lapply(dtTree$leaves,     # find all paths from root to leave on the tree
                  function(z) t(as.matrix(as.numeric(FindNode(dtTree,z$name)$path))))
  myPaths<-rbind.fill.matrix(myPaths)  # create matrix with variable row lengths
  myPaths
}

Analyze unpruned tree which has the following configuration:

plot(as.Node(as.party(sfit)))

Or:

plot(as.Node(as.party(sfit)),output = 'visNetwork')

Create matrix of paths.

(pts<-rpartPaths(sfit))
##       1  2  3  4  5  6  7
##  [1,] 1  2  3  4  5  6 NA
##  [2,] 1  2  3  4  5  7 NA
##  [3,] 1  2  3  4  8 NA NA
##  [4,] 1  2  3  9 10 NA NA
##  [5,] 1  2  3  9 11 12 NA
##  [6,] 1  2  3  9 11 13 14
##  [7,] 1  2  3  9 11 13 15
##  [8,] 1  2 16 17 18 19 NA
##  [9,] 1  2 16 17 18 20 NA
## [10,] 1  2 16 17 21 NA NA
## [11,] 1  2 16 22 23 24 NA
## [12,] 1  2 16 22 23 25 NA
## [13,] 1  2 16 22 26 NA NA
## [14,] 1 27 28 29 30 NA NA
## [15,] 1 27 28 29 31 NA NA
## [16,] 1 27 28 32 33 NA NA
## [17,] 1 27 28 32 34 NA NA
## [18,] 1 27 35 36 37 NA NA
## [19,] 1 27 35 36 38 39 NA
## [20,] 1 27 35 36 38 40 NA
## [21,] 1 27 35 41 42 43 NA
## [22,] 1 27 35 41 42 44 NA
## [23,] 1 27 35 41 45 46 NA
## [24,] 1 27 35 41 45 47 NA

Create vector of all terminal nodes.

(lvs<-apply(pts,1,function(z) tail(na.omit(z),1)))
##  [1]  6  7  8 10 12 14 15 19 20 21 24 25 26 30 31 33 34 37 39 40 43 44 46 47

Create table of numbers of observations in terminal nodes

(distrInLeaves<-table(sfit$where))
## 
##  6  7  8 10 12 14 15 19 20 21 24 25 26 30 31 33 34 37 39 40 43 44 46 47 
## 90 60 60 70 98 12 30 30 30 30 30 15 45 36 54 42 18 24 18 18  9 36 30 15

Create vector of predictions by unpruned tree.

Pred<-round(predict(sfit,newdata=solder),2)

Finally, collect in one matrix the data, predictions,terminal nodes and entire paths for each observation.

completeTreeMatrix<-cbind(solder,Pred,Where=sfit$where,pts[match(sfit$where,lvs),])
head(completeTreeMatrix)
##   Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6  7
## 1       L  Thick A1.5      W4     1     0 0.07     6 1 2 3 4 5 6 NA
## 2       L  Thick A1.5      W4     2     0 0.07     6 1 2 3 4 5 6 NA
## 3       L  Thick A1.5      W4     3     0 0.07     6 1 2 3 4 5 6 NA
## 4       L  Thick A1.5      D4     1     0 0.07     6 1 2 3 4 5 6 NA
## 5       L  Thick A1.5      D4     2     0 0.07     6 1 2 3 4 5 6 NA
## 6       L  Thick A1.5      D4     3     0 0.07     6 1 2 3 4 5 6 NA

Explore switches at different levels of the tree.

completeTreeMatrix[58:62,]
##    Opening Solder Mask PadType Panel skips Pred Where 1  2  3  4  5  6  7
## 58       M  Thick A1.5      L9     1     0 0.60     7 1  2  3  4  5  7 NA
## 59       M  Thick A1.5      L9     2     0 0.60     7 1  2  3  4  5  7 NA
## 60       M  Thick A1.5      L9     3     1 0.60     7 1  2  3  4  5  7 NA
## 61       S  Thick A1.5      W4     1     1 3.08    31 1 27 28 29 31 NA NA
## 62       S  Thick A1.5      W4     2     0 3.08    31 1 27 28 29 31 NA NA

At 61 Opening switched from “M” to “S” and pad type switched from “L9” to “W4”. Both changes resulted in jump of predicted values from 0.17 to 1.16.
Change of the path on the tree was from node 2 to node 27 on the second level.

completeTreeMatrix[88:92,]
##    Opening Solder Mask PadType Panel skips Pred Where 1  2  3  4  5  6  7
## 88       S  Thick A1.5      L9     1     1 1.05    30 1 27 28 29 30 NA NA
## 89       S  Thick A1.5      L9     2     1 1.05    30 1 27 28 29 30 NA NA
## 90       S  Thick A1.5      L9     3     1 1.05    30 1 27 28 29 30 NA NA
## 91       L   Thin A1.5      W4     1     1 0.73    10 1  2  3  9 10 NA NA
## 92       L   Thin A1.5      W4     2     1 1.45    12 1  2  3  9 11 12 NA

At 91 switched back from node 27 to node 2, level 2 as a result of change in opening and pad type.

completeTreeMatrix[448:462,]
##     Opening Solder Mask PadType Panel skips Pred Where 1  2  3  4  5  6  7
## 448       S  Thick   A3      L9     1     0 1.05    30 1 27 28 29 30 NA NA
## 449       S  Thick   A3      L9     2     4 1.05    30 1 27 28 29 30 NA NA
## 450       S  Thick   A3      L9     3     4 1.05    30 1 27 28 29 30 NA NA
## 451       L   Thin   A3      W4     1     0 0.73    10 1  2  3  9 10 NA NA
## 452       L   Thin   A3      W4     2     2 1.45    12 1  2  3  9 11 12 NA
## 453       L   Thin   A3      W4     3     1 1.45    12 1  2  3  9 11 12 NA
## 454       L   Thin   A3      D4     1     0 0.73    10 1  2  3  9 10 NA NA
## 455       L   Thin   A3      D4     2     7 4.44    15 1  2  3  9 11 13 15
## 456       L   Thin   A3      D4     3     8 4.44    15 1  2  3  9 11 13 15
## 457       L   Thin   A3      L4     1     2 0.73    10 1  2  3  9 10 NA NA
## 458       L   Thin   A3      L4     2     2 4.44    15 1  2  3  9 11 13 15
## 459       L   Thin   A3      L4     3     3 4.44    15 1  2  3  9 11 13 15
## 460       L   Thin   A3      D6     1     0 0.73    10 1  2  3  9 10 NA NA
## 461       L   Thin   A3      D6     2     5 1.45    12 1  2  3  9 11 12 NA
## 462       L   Thin   A3      D6     3     3 1.45    12 1  2  3  9 11 12 NA

At 451 big change at level 2 as a result of change in opening, solder and pad type.

completeTreeMatrix[506:512,]
##     Opening Solder Mask PadType Panel skips  Pred Where 1  2  3  4  5  6  7
## 506       M   Thin   A6      W9     2     0  6.23    21 1  2 16 17 21 NA NA
## 507       M   Thin   A6      W9     3     1  6.23    21 1  2 16 17 21 NA NA
## 508       M   Thin   A6      L9     1     3  6.23    21 1  2 16 17 21 NA NA
## 509       M   Thin   A6      L9     2     7  6.23    21 1  2 16 17 21 NA NA
## 510       M   Thin   A6      L9     3    15  6.23    21 1  2 16 17 21 NA NA
## 511       S   Thin   A6      W4     1    16 27.00    46 1 27 35 41 45 46 NA
## 512       S   Thin   A6      W4     2    34 27.00    46 1 27 35 41 45 46 NA

At 511 big change as a result of switch in opening.

It is possible to select pruning level based on intensity of switches.

apply(completeTreeMatrix[,10:15],2,
      function(z) sum(diff(na.omit(z))!=0)/length(na.omit(z)))
##          2          3          4          5          6          7 
## 0.02111111 0.02333333 0.05888889 0.25444444 0.13435701 0.02380952

Noise jumps at level 5.
Plot switches at each level.

plot(completeTreeMatrix[,10],type="l",main="Level 2",ylab="Node #")

plot(completeTreeMatrix[,11],type="l",main="Level 3",ylab="Node #")

plot(completeTreeMatrix[,12],type="l",main="Level 4",ylab="Node #")

plot(completeTreeMatrix[,13],type="l",main="Level 5",ylab="Node #")

plot(completeTreeMatrix[,14],type="l",main="Level 6",ylab="Node #")

plot(completeTreeMatrix[,15],type="l",main="Level 7",ylab="Node #")

Note that at level 6 there is a big switch at 511 from node 21 to node 46 resulting in change in predicted counts from 6.23 to 27. But this level is still not necessary because closer to the root at level 2 there is switch from node 2 to 27.

Prune tree to 4 levels.
This corresponds to cp level of 0.0287271.

printcp(sfit)
## 
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel, 
##     data = solder, method = "poisson", control = rpart.control(cp = 0.004, 
##         maxcompete = 2))
## 
## Variables actually used in tree construction:
## [1] Mask    Opening PadType Panel   Solder 
## 
## Root node error: 8788.2/900 = 9.7647
## 
## n= 900 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.3037885      0   1.00000 1.00270 0.0521095
## 2  0.1540759      1   0.69621 0.69973 0.0326983
## 3  0.1284762      2   0.54214 0.54514 0.0252752
## 4  0.0428654      3   0.41366 0.41711 0.0194107
## 5  0.0287271      4   0.37079 0.37616 0.0173988
## 6  0.0272254      5   0.34207 0.36314 0.0171739
## 7  0.0183899      6   0.31484 0.32327 0.0151030
## 8  0.0156026      7   0.29645 0.31862 0.0148515
## 9  0.0125093      8   0.28085 0.29398 0.0136167
## 10 0.0089906     10   0.25583 0.27232 0.0124124
## 11 0.0085255     11   0.24684 0.26565 0.0117998
## 12 0.0066697     13   0.22979 0.25594 0.0114715
## 13 0.0063814     14   0.22312 0.25094 0.0113376
## 14 0.0060658     15   0.21674 0.24973 0.0112412
## 15 0.0059467     16   0.21067 0.24438 0.0109590
## 16 0.0058737     17   0.20473 0.24440 0.0109517
## 17 0.0055449     18   0.19885 0.23646 0.0107538
## 18 0.0050760     19   0.19331 0.23148 0.0105358
## 19 0.0045152     20   0.18823 0.21829 0.0099479
## 20 0.0044702     21   0.18372 0.21508 0.0098090
## 21 0.0044302     22   0.17925 0.21459 0.0099189
## 22 0.0040000     23   0.17482 0.20869 0.0096200
prunedFit4<-prune(sfit,cp=0.0287271)
prp(prunedFit4,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=T) # display the node numbers, default is FALSE

plot(as.Node(as.party(prunedFit4)))
plot(as.Node(as.party(prunedFit4)),output = 'visNetwork')

1.4.2 Negative binomial regression model

Recall example with these data in Linear and Nonlinear Model.
There we showed that Poisson regression does not fit well these data: there is significant overdispersion.
Instead, we used negative binomial regression.
Fit negative binomial regression and compare residuals of both models.

modnb <- glm.nb(skips ~ .,solder)
summary(modnb)
## 
## Call:
## glm.nb(formula = skips ~ ., data = solder, init.theta = 4.52811339, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7047  -1.0109  -0.3921   0.4480   2.8869  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.29859    0.13202  -9.837  < 2e-16 ***
## OpeningM     0.50353    0.07932   6.348 2.18e-10 ***
## OpeningS     1.91104    0.07111  26.876  < 2e-16 ***
## SolderThin   0.93513    0.05323  17.567  < 2e-16 ***
## MaskA3       0.58383    0.09592   6.087 1.15e-09 ***
## MaskA6       2.26096    0.10101  22.384  < 2e-16 ***
## MaskB3       1.20651    0.09572  12.605  < 2e-16 ***
## MaskB6       1.98172    0.09158  21.638  < 2e-16 ***
## PadTypeD6   -0.46189    0.11145  -4.144 3.41e-05 ***
## PadTypeD7   -0.03182    0.10584  -0.301 0.763655    
## PadTypeL4    0.38119    0.10177   3.745 0.000180 ***
## PadTypeL6   -0.57860    0.11327  -5.108 3.25e-07 ***
## PadTypeL7   -0.36569    0.11006  -3.323 0.000891 ***
## PadTypeL8   -0.15882    0.10734  -1.480 0.138953    
## PadTypeL9   -0.56554    0.11306  -5.002 5.67e-07 ***
## PadTypeW4   -0.19851    0.10783  -1.841 0.065630 .  
## PadTypeW9   -1.56332    0.13538 -11.547  < 2e-16 ***
## Panel2       0.29574    0.06322   4.678 2.90e-06 ***
## Panel3       0.33380    0.06298   5.300 1.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.5281) family taken to be 1)
## 
##     Null deviance: 4097.6  on 899  degrees of freedom
## Residual deviance: 1012.1  on 881  degrees of freedom
## AIC: 3679.5
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.528 
##           Std. Err.:  0.518 
## 
##  2 x log-likelihood:  -3639.514
matplot(1:length(modnb$residuals),cbind(modnb$residuals,resid(sfit)),type="p",pch=c(1,19),
        xlab="Index",ylab="Residuals",main="Comparison of Tree Regression and NB Regression")
legend("bottomleft",legend=c("NB","Tree"),col=c("black","red"),lty=1,lwd=3)

What is the sign of overdispersion in the summary of the model?

Calculate residual mean square error of the negative binomial model.

nbRmse<-rmse(modnb$residuals)

Calculate probability of zero count by the negative binomial model for row 6 of data frame solder.

mu<-predict(modnb,solder[6,],type="response")
pred0NB<-dnbinom(0,mu=mu,size=modnb$theta)

Compare mrse and probabilities of zero count for row 6 of the data given by the 2 models.

c(Tree_RMSE=treeRmse,NB_RMSE=nbRmse)
## Tree_RMSE   NB_RMSE 
##  5.815413  1.068110
c(tree_0Probability=pred0Tree,NB_0Probability=pred0NB)
## tree_0Probability   NB_0Probability 
##         0.3559772         0.6935889

2 Time series of stock prices

Predict returns of exchange traded fund SPY representing S&P 500 with a group of stock returns of companies in the index.

Select year 2014.

datapath<-"Your path to the folder with the data"
SPYPortf<-read.csv(paste(dataPath,"spyPortfolio.csv",sep="/"))
head(SPYPortf,3)
##     SPLS.A    MTB.A    UNM.A    VLO.A AMZN.A ADBE.A    CSX.A     PG.A    CMA.A
## 1 13.35805 106.3561 31.95808 44.67762 397.97  59.29 26.06945 71.50668 44.10851
## 2 13.52942 106.4948 31.96739 44.21176 396.44  59.16 26.23555 71.42678 44.32504
## 3 13.13528 106.1620 31.92084 44.64179 393.63  58.12 26.04176 71.59547 44.24972
##      PEP.A    DNB.A   MDLZ.A    SYY.A   VIAB.A    MDT.A      T.A    CHK.A
## 1 74.31774 113.9543 32.85969 32.90575 79.78655 53.40803 28.78814 24.55435
## 2 74.44447 115.0964 32.80304 33.02484 79.46537 54.43439 28.66459 24.36987
## 3 74.48068 113.7090 32.59536 32.97904 78.66707 55.25548 28.79637 24.16694
##      MRO.A    TMK.A  MU.A AABA.A    MON.A AKAM.A    HSY.A    BAC.A    WDC.A
## 1 32.30710 51.62000 21.66  39.59 108.2317  46.53 88.48643 15.44962 75.43743
## 2 31.94617 51.72000 20.97  40.12 108.2690  46.45 88.37573 15.74710 75.98487
## 3 31.86289 51.53333 20.67  39.93 107.6455  46.11 88.12668 15.98700 75.62904
##       EL.A    SPY.A
## 1 70.42357 170.5143
## 2 70.36625 170.4863
## 3 70.77708 169.9923

Create daily log returns of all stocks and SPY.
Make daily log returns of all stock prices lagged one day relative to the daily log returns of SPY.

SPYPortf<-log(SPYPortf)
SPYPortf<-apply(SPYPortf,2,diff)

Calculate meta-features as PCA factors.

SPYPortfPCA<-princomp(SPYPortf[,-28])
SPYPoptf.factors<-SPYPortfPCA$scores[,1:3]
SPYPortfFactors<-as.data.frame(cbind(SPYPoptf.factors,SPY=SPYPortf[,28]))
head(SPYPortfFactors,3)
##         Comp.1       Comp.2       Comp.3           SPY
## 1 -0.006240906  0.013539045  0.017551027 -0.0001639528
## 2 -0.025357058 -0.008313471 -0.023439077 -0.0029021754
## 3  0.066077676  0.012254176  0.009869272  0.0061230653

Grow regression tree.

tsfit <- rpart(SPY ~.,data = SPYPortfFactors, method ='anova',
              control = rpart.control(cp = 0.001))
printcp(tsfit)
## 
## Regression tree:
## rpart(formula = SPY ~ ., data = SPYPortfFactors, method = "anova", 
##     control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] Comp.1 Comp.2 Comp.3
## 
## Root node error: 0.012407/250 = 4.9629e-05
## 
## n= 250 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.5154147      0   1.00000 1.00585 0.119479
## 2  0.1411214      1   0.48459 0.50674 0.060971
## 3  0.1041148      2   0.34346 0.42592 0.057085
## 4  0.0482427      3   0.23935 0.26815 0.033231
## 5  0.0164834      4   0.19111 0.22097 0.021100
## 6  0.0162673      5   0.17462 0.21255 0.020640
## 7  0.0068456      6   0.15836 0.18700 0.018610
## 8  0.0037203      7   0.15151 0.19239 0.019007
## 9  0.0033124      8   0.14779 0.19240 0.019065
## 10 0.0025532      9   0.14448 0.19534 0.019077
## 11 0.0022440     10   0.14192 0.19231 0.019065
## 12 0.0021753     11   0.13968 0.19647 0.019319
## 13 0.0017223     12   0.13750 0.19724 0.019229
## 14 0.0014699     13   0.13578 0.19256 0.019074
## 15 0.0011689     14   0.13431 0.19339 0.019336
## 16 0.0010041     15   0.13314 0.19107 0.019345
## 17 0.0010000     16   0.13214 0.19104 0.019316
prp(tsfit,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

Prune the tree.

(best.CP = tsfit$cptable[which.min(tsfit$cptable[,"xerror"]),"CP"])
## [1] 0.006845631

In order to make predictions less trivial prune a little less than suggested by best.CP. 0.0022440

prunedTsFit<-prune(tsfit,cp=0.0037203)
prp(prunedTsFit,extra=101, # display the number of observations that fall in the node
    branch=.5, # change angle of branch lines
    shadow.col="gray", # shadows under the leaves
    branch.lty=3, # draw branches using dotted lines
    split.cex=1.2, # make the split text larger than the node text
    split.prefix="is ", # put "is " before split text
    split.suffix="?", # put "?" after split text
    split.box.col="lightgray", # lightgray split boxes (default is white)
    split.border.col="darkgray", # darkgray border on split boxes
    split.round=.5,
    nn=TRUE) # display the node numbers, default is FALSE

Interpret the tree.

Create matrix of paths.

(tsSPY<-rpartPaths(prunedTsFit))
##      1 2  3  4  5
## [1,] 1 2  3 NA NA
## [2,] 1 2  4  5 NA
## [3,] 1 2  4  6 NA
## [4,] 1 7  8  9 NA
## [5,] 1 7  8 10 NA
## [6,] 1 7 11 12 13
## [7,] 1 7 11 12 14
## [8,] 1 7 11 15 NA
plot(as.Node(as.party(prunedTsFit)))

Create vector of all terminal nodes.

(lvsSPY<-apply(tsSPY,1,function(z) tail(na.omit(z),1)))
## [1]  3  5  6  9 10 13 14 15

Create table of numbers of observations in terminal nodes

table(prunedTsFit$where)
## 
##  3  5  6  9 10 13 14 15 
## 17 18 34 67 55  7 43  9

Create vector of predictions by pruned tree.

PredSPY<-round(predict(prunedTsFit,newdata=SPYPortfFactors),6)

Finally, collect in one matrix the data, predictions,terminal nodes and entire paths for each observation.

completeMatrixSPY<-cbind(round(SPYPortfFactors,4),PredSPY,Where=prunedTsFit$where,
                         tsSPY[match(prunedTsFit$where,lvsSPY),])
head(completeMatrixSPY,7)
##    Comp.1  Comp.2  Comp.3     SPY   PredSPY Where 1 2  3  4  5
## 1 -0.0062  0.0135  0.0176 -0.0002  0.000370     9 1 7  8  9 NA
## 2 -0.0254 -0.0083 -0.0234 -0.0029 -0.003700     6 1 2  4  6 NA
## 3  0.0661  0.0123  0.0099  0.0061  0.007353    14 1 7 11 12 14
## 4  0.0376  0.0217 -0.0376  0.0002  0.007353    14 1 7 11 12 14
## 5 -0.0090 -0.0038 -0.0248  0.0007  0.000370     9 1 7  8  9 NA
## 6  0.0144 -0.0041 -0.0161  0.0027  0.002972    10 1 7  8 10 NA
## 7 -0.0906 -0.0104 -0.0199 -0.0134 -0.015175     3 1 2  3 NA NA

Analyze the pattern of the first 7 days in the period.

All leaves are ordered by the size of predicted return from -0.015 in node 3 to almost 0 in node 9, to 0.016 in node 15.

Rows 3, 4, 5, 6 ended in nodes with positive predicted returns.
Row 7 ended in negative predicted return.
Was it possible to see it coming?

First, note that the big divide is between nodes 2 and 7. On the side of node 2 return is always negative. On the side of node 7 it zero or positive.

In rows 3-6 trajectories go through node 7 and in row 7 there is a switch to node 2.

While going through node 7 trajectories in rows 3, 4 went through node 11 which guarantees positive return. But on day 5 trajectory switched from node 11 to node 8 which can lead to one slightly positive node (10), and one neutral node (9).

On days 5 and 6 trajectories went through node 8 to non-negative terminal nodes (9, 10), but the process was only a switch away from turning to the negative side which happened on day 7.

Check column SPY in the table.
Actual returns had the same signs as predictions.

In general the tree regression model predicted actual returns reasonably well.

plot(completeMatrixSPY[,"PredSPY"],type="l",col="blue",lwd=3,
     ylim=range(completeMatrixSPY[,"SPY"]))
lines(completeMatrixSPY[,"SPY"],col="orange")
legend("top",legend=c("Predicted","Actual"),col=c("blue","orange"),lwd=c(2,1))

plot(completeMatrixSPY[,"PredSPY"],completeMatrixSPY[,"SPY"])

2.1 Example. Large Number of Predictors

Consider the example analysed in the workshop Linear Regression with Large Number of Predictors.

Simulate data

N = 500
set.seed(0)
Epsilon<-rnorm(N,0,1)
X<-rnorm(N*N,0,2)
dim(X)<-c(N,N)
colnames(X)<-paste0("X",1:N)
slopesSet<-runif(N,1,3)
Y<-sapply(2:N,function(z) 1+X[,1:z]%*%slopesSet[1:z]+Epsilon)
head(X[,1:5])
##              X1         X2         X3          X4         X5
## [1,] -1.2527365 -0.5737031  0.1575234  0.88869164 -1.0643804
## [2,]  0.9626707  3.6822138 -0.1028412  0.02385876 -0.7399334
## [3,]  3.3905422 -0.3135286 -0.5215483 -0.01856009  2.3417986
## [4,] -3.5224526 -2.7796053  3.1325364 -0.60475511 -0.9291356
## [5,]  0.3960260 -2.9462080 -0.7440302  0.98471004 -0.6042920
## [6,]  0.7946982 -0.1390379  3.4838889 -1.20543924  2.7827418

Fit linear regression model with 440 regressors.

m = 440
completeModelDataFrame <- data.frame(Y=Y[,m-1],X[,1:m])
m440 <- lm(Y~.,data=completeModelDataFrame)

We reduced the number of regressors in completeModelDataFrame in order to ensure that on each iteration of 10-fold cross validation the number of regressors is less than the number of observations.
The residual standard error of the model is

summary(m440)$sigma
## [1] 0.9167553

Apply regression tree to these data.

treeFit <- rpart(Y~., data=completeModelDataFrame)
printcp(treeFit)
## 
## Regression tree:
## rpart(formula = Y ~ ., data = completeModelDataFrame)
## 
## Variables actually used in tree construction:
##  [1] X106 X113 X131 X157 X166 X183 X2   X204 X221 X232 X248 X275 X285 X31  X325
## [16] X326 X335 X350 X354 X356 X362 X375 X382 X418 X47  X66  X68  X71  X78  X86 
## 
## Root node error: 3764475/500 = 7528.9
## 
## n= 500 
## 
##          CP nsplit rel error xerror     xstd
## 1  0.040633      0   1.00000 1.0027 0.059277
## 2  0.039101      1   0.95937 1.1287 0.067913
## 3  0.034782      2   0.92027 1.1412 0.068250
## 4  0.030455      3   0.88548 1.2193 0.073437
## 5  0.028779      4   0.85503 1.3002 0.076320
## 6  0.027717      5   0.82625 1.3399 0.077881
## 7  0.027244      6   0.79853 1.3550 0.078162
## 8  0.026064      8   0.74404 1.3717 0.078681
## 9  0.024455     10   0.69192 1.3900 0.079635
## 10 0.023939     11   0.66746 1.3994 0.081147
## 11 0.021970     13   0.61959 1.4132 0.082570
## 12 0.020786     14   0.59762 1.4611 0.086584
## 13 0.020574     15   0.57683 1.4632 0.086215
## 14 0.019686     16   0.55626 1.4712 0.086204
## 15 0.019472     17   0.53657 1.4673 0.086224
## 16 0.018328     18   0.51710 1.5050 0.089524
## 17 0.017346     19   0.49877 1.5116 0.089904
## 18 0.016112     20   0.48142 1.5396 0.091424
## 19 0.016005     21   0.46531 1.5422 0.091847
## 20 0.014976     22   0.44931 1.5513 0.092380
## 21 0.014302     23   0.43433 1.5635 0.093042
## 22 0.013641     24   0.42003 1.5920 0.093758
## 23 0.012147     25   0.40639 1.5811 0.094859
## 24 0.011826     26   0.39424 1.6007 0.094303
## 25 0.011815     27   0.38242 1.6008 0.094356
## 26 0.011456     28   0.37060 1.6014 0.094335
## 27 0.011242     29   0.35914 1.6032 0.094611
## 28 0.010000     30   0.34790 1.6144 0.094939
plotcp(treeFit)

The outputs of printcp and plotcp show that trees can’t handle this data at all.
Though relative error can be reduced up to 35% of root error, xerror (cross validation error) increases almost monotonically.
It means that any split implies overfitting and the best tree-based prediction model is constant.
The RSE of the constant prediction is

sd(completeModelDataFrame$Y)
## [1] 86.85642

Use package caret to run cross validation.

ctrl <- trainControl(method = "cv", number = 10)
lmTrain <- train(Y~.,data=completeModelDataFrame,
                   method = 'lm', trControl = ctrl)
lmTrain$results
##   intercept   RMSE  Rsquared      MAE   RMSESD  RsquaredSD    MAESD
## 1      TRUE 6.4678 0.9939519 5.234991 1.898774 0.004034775 1.503631

Linear method displayed much better predictive quality than trees in this example.
Trees could not deal with this complex model with so many predictors.

3 Discussion

Claudia Perlich et al. presented a large-scale experimental comparison of trees and linear models methods in the paper Tree Induction vs. Logistic Regression: A Learning-Curve Analysis.
Authors examined several dozens large, two-class data sets, ranging from roughly one thousand examples to two million examples.
The results of the study show the following.